Identifying Character Encodings on the Internet
In many Internet protocols, acharset
parameter may be used in certain contexts to specify both a character set and a character encoding scheme. The value of thecharset
parameter is a case-insensitive string limited to the characters A-Z, a-z, 0-9, hyphen-minus, underscore, period, and colon. The character encoding names specified for this parameter are generally expressed in US-ASCII octet values.The character encoding name may be an experimental name beginning with
x-
; if it is not an experimental name, it must be a name registered with the Internet Assigned Numbers Authority (IANA) that corresponds to a character encoding that has a formal specification. Multiple names exist for most character encodings in the registry. The IANA registry is updated periodically; for example, the name EUC-JP was added to it in January. Table C-1 (page 243) identifies character encodings for various languages, gives some of their common Internet names, and tells when the character encoding was first supported for the Text Encoding Converter and the Unicode Converter. To preview the style of character set name used on the Internet, here are a few sample names:
ISO-8859-1 latin1 UNICODE-1-1-UTF-7 Shift_JIS X-EUC-CN
Many of the character encodings in use on the Internet are not registered with IANA and do not have official Internet names, although they may have names that have become de facto standards. Moreover, even when an encoding is registered, the name specified by IANA may not be the one that is actually used on the Internet. For example, EUC-JP has been registered for some time with the unwieldy name
Extended_UNIX_Code_Packed_Format_for_Japanese
, but the name actually used is the unofficialX-EUC-JP
. Another example,Shift_JIS
,is the official name, but the names commonly used in its stead are
x-shift-jis
andx-sjis
. In many cases, mail and browser software recognizes only the unofficial names, not the official ones.In some cases, the names for unregistered encodings follow a pattern established by other, registered encodings. For example, some IBM/Microsoft code pages are registered with names consisting of
cp
followed by the code page number:cp437
,cp850
,cp852
. Code page874
is not registered, but the namecp874
would be expected. Most Windows code pages are registered using the form used in these examples:windows-1250
,windows-1251
. Windows Latin-1 is, oddly enough, not registered as eitherwindows-1252
orcp1252
, although both forms are in use.